The Ronaldo Effect!!!¶


Introduction¶


Cristiano Ronaldo dos Santos Aveiro

  • One of the best strikers in the world of football, Cristiano Ronaldo, began playing for the Portuguese team Sporting CP at age 17. He got a contract with Manchester United football club a year later. He has also played for a spanish club Real Madrid and an italian club Juventus. He is a Portuguese professional footballer who plays as a forward for Premier League club Manchester United and captains the Portugal national team.
  • He currently holds the record for the most number of goals scored in the world.

  • Current team: Portugal national football team (#7 / Forward) Trending

  • Born: February 5, 1985 (age 37 years), Hospital Dr. Nélio Mendonça, Funchal, Portugal

  • Height: 1.87 m
  • Partner: Georgina Rodríguez (2017–)
  • Salary: 26.52 million GBP (2022)
  • Children: Cristiano Ronaldo Jr., Alana Martina dos Santos Aveiro, Eva Maria Dos Santos, Mateo Ronaldo

Manchester United

  • One of the greatest football clubs in the history of the sport that competes in the Premier League. The club currently holds the record for winning 13 Premier League titles.

The Premier League:

  • The Premier League is the top tier of England's football pyramid, with 20 teams battling it out for the honour of being crowned English champions.

  • It is also the most-watched league on the planet with one billion homes watching the action in 188 countries. A home to some of the most famous clubs, players, managers and stadiums in world football, one of them being Manchester United.

Football

  • Football is a two-team sport with a maximum of 11 players on each squad. Each game consists of two 45-minute halves separated by a 15 minutes break. A football match, however, does not finish in 90 minutes. Due to substitution, injuries, and disciplinary actions, it occasionally lasts longer.

  • One of the most thrilling positions in football is forward. The goal of the forward is to score and to put pressure on their opponents to make errors. In addition, great forward players need good dribbling, shooting and heading skills.

Objectives¶


Soccer Magzine

As a group of data analysts, we have been hired by the 'World Soccer' sports magazine. Amid the thrilling matches of the FIFA World Cup, the management wants us to release a detailed report on Cristiano Ronaldo using data driven insights to attract and engage the audience.

We have tried to put up the whole story by connecting different parts of his career and drilling down into his statistics for the different teams he has played for. Our main objectives are:
  • How has Manchester United's performance been over the last 20 years and how has Ronaldo been a huge contribution.
  • Determine how playing matches on the home ground affected the winni-ng probability for Manchester Untited during the period where Ronaldo played for them vs the period he did not.
  • We are also interested in analysing how Ronaldo's performance boosted Manchester United's winning probability when was playing at the home ground. We will be comparing the percentages between (2003-2009) and remaining seasons data for the club as a metric to add value.
  • What does the trajectory of Ronaldo's career look like?
  • Drilling down to one his best performances in a season i.e 2017/18 for Real Madrid, we try to analyse the pattern of how he scores goals.
These questions help us to understand Ronaldo's career as a whole and how he helped Manchester United club to dominate the football world. We will be using winning probability at the home ground as a metric to assess his impact on Manchester United. This report will immensely help Manchester United's new business committee to make strategic decisions while acquiring players in the future. For sure, there will be no other player like him, but there is always a question of who has the potential to replace him after his retirement.

Datasets and Importing Libraries¶


In [9]:
#Library for data analysis and processing

import numpy as np
import pandas as pd
import datetime as dt
import seaborn as sns
import regex as re
import matplotlib.pyplot as plt
from matplotlib.cbook import get_sample_data
from matplotlib.offsetbox import (OffsetImage, AnnotationBbox)


import os

#Plotly Library to make graphs
import plotly.io as pio
import plotly.graph_objs as go
from plotly.subplots import make_subplots
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.express as px



#Library to be used for reading json file

import json

#Bokeh Library to make graphs
import bokeh
from bokeh.io import output_notebook
from bokeh.plotting import figure, show
from bokeh.models import ColumnDataSource
from bokeh.models.widgets import DataTable, DateFormatter, TableColumn

#Library to display images
from IPython.display import Image

#Use the output_notebook() function to display Bokeh plots in Jupyter notebook
output_notebook()

#Library to ignore warnings
import warnings
warnings.filterwarnings('ignore')
Loading BokehJS ...

Datasets¶

Detailed description of the data set(s) and how they were acquired:

1. Premier league matches Dataset

https://www.kaggle.com/datasets/evangower/premier-league-matches-19922022

This dataset acquired from Kaggle includes every game ever played in the English Premier League, starting in 1992 and continuing through the last week of the 2021–2022 season. Each season lasts for 1 year, and a total of 20 teams compete in the competition. It gives us the details about each match:

  • Where it took place
  • How many goals were scored
  • Who won the match etc.
2. Manchester United Club's achievements Dataset

https://en.wikipedia.org/wiki/List_of_Manchester_United_F.C._seasons

This webpage lists the details of the Manchester United club's achievements in major competitions for all years from its inception. We will use information in the 'Results of league and cup competitions by season' table to identify in which seasons the club has won the premier league title.

3. Match Events Dataset - Player, Time, Outcome, Position details

https://www.nature.com/articles/s41597-019-0247-7

Public dataset that contains spatio-temporal match events that had occurred during football match.Each of the match event consists of information about its position, time, outcome, player and characteristics. We downloaded a JSON file which was read converted to a pandas dataframe for analysis.

4. Ronaldo's Club goals dataset

https://www.kaggle.com/datasets/azminetoushikwasi/cr7-cristiano-ronaldo-all-club-goals-stats

This dataset acquired from Kaggle consists of the complete list of all club goals of Ronaldo. CSV file was used for processing and analysis.

Project should be graded more heavily on - ¶

As mentioned above we have used 6 datasets in different file formats- csv, json. We have also collected data from publicly available website - Wikipedia. We followed a series of steps to process the data and get a Dataframe ready after pre-processing.

Manchester United in Last 20 years - We are analyzing how they maintained consistency by winning 3 times in a row. We have analyzed "Season" and "Standing" columns for this purpose.

Home ground Analysis at Manchester United - Here we have analyzed the winning probability of Manchester United on home ground when Ronaldo was playing at the club v/s when Ronaldo was not playing with the club.

Ronaldo Goals Scoring patterns - For La liga dataset we merged data from 2 different dataframes and analyzed data to find patterns from which position did Ronaldo attempted a goal and when was it successful. The analysis included

  • Finding tag information (when was it a own goal / recieved an assist)
  • Analyzing event and match information from 2 datasets.
  • Get approriate x and y axis co-ordinates for plotting the positions.

Objective I - Manchester United in last 20 years¶

In [10]:
# Importing the Manchester United season's data from wikipedia

man_united_seasons = pd.read_html('https://en.wikipedia.org/wiki/List_of_Manchester_United_F.C._seasons#Seasons', header=0,match='Season')[0]

# Display the dataframe

man_united_seasons.head()
Out[10]:
Season League League.1 League.2 League.3 League.4 League.5 League.6 League.7 League.8 League.9 FA Cup EFL Cup CommunityShield UEFAFIFA Top goalscorer(s)[a] Top goalscorer(s)[a].1
0 Season Division Tier Pld W D L GF GA Pts Pos FA Cup EFL Cup CommunityShield UEFAFIFA Name(s) Goals
1 1886–87[b] — NaN — — — — — — — — R1 NaN NaN NaN Jack Doughty 4
2 1888–89[c] Combination NaN 12 8 2 2 27 13 18 — — NaN NaN NaN Jack DoughtyRoger Doughty 6
3 1889–90 Alliance NaN 22 9 2 11 40 45 20 8th R1 NaN NaN NaN Willie Stewart 10
4 1890–91 Alliance NaN 22 7 3 12 37 55 17 9th QR2 NaN NaN NaN Bob Ramsay 7

Data Cleaning¶

From this dataframe we need only two columns "Season" and "Pos"

  • Season: part of the year during which football matches were held.
  • Pos: ranking in all of the competitions/leagues Manchester United has participated in.

Let us rename Pos column to Standings as the values denote where the club is placed in rankings in that particular season.

Drop, Rename, Duplicates and Null Values¶

In [11]:
# Making the 1st row as the header as current header is not required

man_united_seasons.columns = man_united_seasons.iloc[0]

# Drop all the columns except Season and Pos and first row

man_united_seasons= man_united_seasons[['Season','Pos']]
man_united_seasons = man_united_seasons.iloc[1: , :]

# Renaming the Pos column to Standing

man_united_seasons.columns = ['Season','Standing']

# Check for duplicates in the dataset

duplicates = man_united_seasons.duplicated().sum()

print(f'Number of duplicates in the dataset are: {duplicates}')

# Identifying the null values

null_values = man_united_seasons.isnull().sum().sum()

print(f'Number of null values in the dataset are: {null_values}')


# Display the dataframe

man_united_seasons.head()
Number of duplicates in the dataset are: 0
Number of null values in the dataset are: 0
Out[11]:
Season Standing
1 1886–87[b] —
2 1888–89[c] —
3 1889–90 8th
4 1890–91 9th
5 1891–92 2nd[d]

Here, we checked for duplicated and values in the cleaned dataframe . When we take the sum of the booleans using duplicated and isnull functions, it was found that there are no duplicates/nulls. Let us proceed with analysis:

Formatting Data using Regex¶

In [12]:
# Update Season column with four digits representing the year

pattern2 = '(\d{4})'
man_united_seasons['Season'] = man_united_seasons['Season'].str.extract(pattern2)
man_united_seasons["Season"]= man_united_seasons["Season"].str.split("-", expand = True)

# Update Standings column with values of two digits
# Note: We will consider 2000-2001 season as the year 2001. 

pattern1 = '(\d{1,2})'
man_united_seasons['Standing'] = man_united_seasons['Standing'].str.extract(pattern1)

# Drop nulls if present

man_united_seasons = man_united_seasons.dropna()

# Convert the column values to integer

man_united_seasons['Season'] = man_united_seasons['Season'].astype(int)
man_united_seasons['Standing'] = man_united_seasons['Standing'].astype(int)

Let's find out Manchester United's football club standings over recent years:

In [13]:
# Taking only the data of Manchester United from year 2000-2022 for analysis

man_united_seasons = man_united_seasons.loc[man_united_seasons['Season']>2000]

# Reset Index

man_united_seasons.reset_index(drop=True, inplace=True)


# Display the dataframe

man_united_seasons.head()
Out[13]:
Season Standing
0 2001 3
1 2002 1
2 2003 3
3 2004 3
4 2005 2

Data Visualization¶

In [14]:
def imscatter(x, y, image, ax=None, zoom=1):
    if ax is None:
        ax = plt.gca()
    try:
        image = plt.imread(image)
    except TypeError:
        # Likely already an array...
        pass
    im = OffsetImage(image, zoom=zoom)
    x, y = np.atleast_1d(x, y)
    artists = []
    for x0, y0 in zip(x, y):
        ab = AnnotationBbox(im, (x0, y0), xycoords='data', frameon=False)
        artists.append(ax.add_artist(ab))
    ax.update_datalim(np.column_stack([x, y]))
    ax.autoscale()
    return artists
In [24]:
#Defining the X and Y axis
x = man_united_seasons['Season'].values.astype(int)
y = man_united_seasons['Standing']


#Extracting the crown image and plotting it on the targeted years
image_path = get_sample_data('/Users/prati/Downloads/Crown.png')
fig, ax = plt.subplots(figsize=(15, 4))
imscatter(x[5:8], y[5:8], image_path, zoom=0.1, ax=ax)

#Making the y axis inverted

plt.title('Manchester United Standing in English Premier League')
plt.xlabel('Year')
plt.ylabel('Rank')
plt.ylim(8,0)
plt.xticks(x,rotation = 45)
ax.plot(x, y)
plt.show()
  • Manchester United football club was crowned champions for three seasons consecutively, where Cristiano Ronaldo scored:
  • 91 goals for the club and was United's top scorer in these three seasons from 2006 to 2009.
  • A total of 42 goals in all competitions during the 2007–08 season, his most prolific campaign at Manchester United.

Ronaldo had good communication and dynamics with his teammates at Manchester United. In 2009, he left this club and joined the Real Madrid football club. Ronaldo's transfer had a huge impact on Manchester United's dynamics and style of playing as they could not substitute any other player with his caliber of skills. His ability to convert free kicks and penalties into goals was extraordinary. He was in his prime years and attracted football lovers' attention on a global level. His leaving the club has impacted Manchester United's standings in the English premier league to decline. From the above line graph, we can see that during the years between 2006 & 2009, Manchester United won the English Premier League title three times in a row.

  • Later, the club could not maintain that consistency as the trend fluctuated in their standings in the league.

  • During the 2012-13 season, they fell to 7th position in the club's standing. In the past 20 years, this particular season was their worst performance.

Objective II - Home ground winning probability¶

The term "home-field advantage" describes the alleged inherent advantage that the side competing at home will have during the match. That team gains from not having to travel and can play in comfortable settings. On the road, though, it's a different scenario for the football teams as it could be a huge factor for fatigue.

In [25]:
# Load dataset for analysis

premier_league_df = pd.read_csv('Premier_League_data.csv')

# Display output printing first 5 rows

premier_league_df.head()
Out[25]:
Season_End_Year Wk Date Home HomeGoals AwayGoals Away FTR
0 1993 1 1992-08-15 Coventry City 2 1 Middlesbrough H
1 1993 1 1992-08-15 Leeds United 2 1 Wimbledon H
2 1993 1 1992-08-15 Sheffield Utd 2 1 Manchester Utd H
3 1993 1 1992-08-15 Crystal Palace 3 3 Blackburn D
4 1993 1 1992-08-15 Arsenal 2 4 Norwich City A

Data Cleaning¶

In the dataset we have "Wk" column which denotes Week, as using Week won't play a significant role in determining the home ground winner, We will not be using this column in analysis going forward, therfore we are dropping the column.

Drop, Duplicates and Null¶

In [26]:
# Dropping the columns from the dataset

premier_league_df = premier_league_df.drop('Wk', axis=1)

# Check for duplicates in the dataset

duplicates = premier_league_df.duplicated().sum()

print(f'Number of duplicates in the dataset are: {duplicates}')

#identifying the null values

null_values = premier_league_df.isnull().sum().sum()

print(f'Number of null values in the dataset are: {null_values}')
Number of duplicates in the dataset are: 0
Number of null values in the dataset are: 0

Here, we checked for duplicated values in our dataset. When we take the sum of the booleans using duplicated and isnull functions, it was found that there are no duplicates/nulls in the dataset.

Note:¶
  • Overall data available in the dataset is from year 1992 to 2022.
  • Cristiano Ronaldo's first stay at Manchester United: 2003-2009

We are interested in analysing how Ronaldo's performance booseted Manchester United winning probability when playing at home ground. We will be comparing the percentages between (2003-2009) and remaining seasons data for the club.

Home ground Analysis at Manchester United¶

To determine whether a team has any benefit of playing at their home ground, let us calculate the percentage of home ground wins, losts, and draws:

Analysis for Manchester United Without Ronaldo Playing¶

In [27]:
percentage = 100

# Filtering Manchester United home games data with no Ronaldo during his first stay

mutd_home_games = premier_league_df[premier_league_df.Home == 'Manchester Utd']

mutd_no_ronaldo = mutd_home_games[(mutd_home_games['Season_End_Year'] < 2003) | (mutd_home_games['Season_End_Year'] > 2009)]

# percentage win in home games
pct_home_wins = ( len(mutd_no_ronaldo[mutd_no_ronaldo['FTR'] == 'H']) / len(mutd_no_ronaldo) ) * percentage
pct_home_wins

# percentage lost in home games
pct_home_loss = ( len(mutd_no_ronaldo[mutd_no_ronaldo['FTR'] == 'A']) / len(mutd_no_ronaldo) ) * percentage
pct_home_loss

# percentage draw in home games
pct_home_draw = ( len(mutd_no_ronaldo[mutd_no_ronaldo['FTR'] == 'D']) / len(mutd_no_ronaldo) ) * percentage
pct_home_draw


# list for pie chart plot

values_unique = ['Win','Lost','Draw']
values_no_ronaldo = [pct_home_wins,pct_home_loss,pct_home_draw]


result = ['Win','Lost','Draw']
percentages_no_ronaldo = [pct_home_wins,pct_home_loss,pct_home_draw]

# Display the output
print(f'Manchester United win percentage at home games: {pct_home_wins:0.2f}%')
print(f'Manchester United lost percentage at home games: {pct_home_loss:0.2f}% ')
print(f'Manchester United draw percentage at home games: {pct_home_draw:0.2f}%')
Manchester United win percentage at home games: 67.95%
Manchester United lost percentage at home games: 12.19% 
Manchester United draw percentage at home games: 19.86%

Analysis for Manchester United with Ronaldo Playing¶

In [28]:
# Filtering Manchester United home games data with Ronaldo during his first stay
mutd_with_ronaldo = mutd_home_games[(mutd_home_games['Season_End_Year'] >= 2003) & (mutd_home_games['Season_End_Year'] <= 2009)]

# percentage win in home games
pct_home_wins = ( len(mutd_with_ronaldo[mutd_with_ronaldo['FTR'] == 'H']) / len(mutd_with_ronaldo) ) * percentage
pct_home_wins

# percentage lost in home games
pct_home_loss = ( len(mutd_with_ronaldo[mutd_with_ronaldo['FTR'] == 'A']) / len(mutd_with_ronaldo) ) * percentage
pct_home_loss

# percentage draw in home games
pct_home_draw = ( len(mutd_with_ronaldo[mutd_with_ronaldo['FTR'] == 'D']) / len(mutd_with_ronaldo) ) * percentage
pct_home_draw

percentages_with_ronaldo = [pct_home_wins,pct_home_loss,pct_home_draw]


# Display the output
print(f'Manchester United win percentage at home games: {pct_home_wins:0.2f}%')
print(f'Manchester United loss percentage at home games: {pct_home_loss:0.2f}% ')
print(f'Manchester United draw percentage at home games: {pct_home_draw:0.2f}%')
Manchester United win percentage at home games: 75.94%
Manchester United loss percentage at home games: 7.52% 
Manchester United draw percentage at home games: 16.54%

Data Visualization¶

In [29]:
#Generic function to beautify output of print statement
class color:
   BOLD = '\033[1m'
In [30]:
#Plot Donut chart for Home Ground Wins
values = percentages_with_ronaldo
colours = ['#224676', '#B93114', '#04152B']
labels=['Winning %', 'Losing %', 'Tie %']


trace1 = {'values': values, 
          'labels': labels,
          'marker': {'colors': colours},
          'type': 'pie',
          'hole': 0.6,
          'title': '2003-2009',
          'showlegend': True}

print(color.BOLD + '% of Home Ground Wins with Ronaldo Playing for Manchester United (2003 - 2009)')

pio.show({'data': [trace1]})

values1 = percentages_no_ronaldo

trace2 = {'values': values1, 
          'labels': labels,
          'marker': {'colors': colours},
          'type': 'pie',
          'hole': 0.6,
          'title': '1992-2003, 2009<',
          'showlegend': True}

print(color.BOLD + '% of Home Ground Wins without Ronaldo Playing for Manchester United (Before 2003 and after 2009)')
pio.show({'data': [trace2]})
% of Home Ground Wins with Ronaldo Playing for Manchester United (2003 - 2009)
% of Home Ground Wins without Ronaldo Playing for Manchester United (Before 2003 and after 2009)
  • The above donut graphs illustrate that the winning probability on home ground was higher for Manchester United when Ronaldo was playing at the club (more than 75%). For seasons where Ronaldo did not play, the win probability for the club is approximately 68%.

  • On the other hand, losing probability also decreased for Manchester United's club when Ronaldo was playing. There is a decline from 12.2% to 7.52%, a good sign for the team.

  • It is evident that Ronaldo boosted the club's success in Premier League. It is because during his stay, win probability increased, and the loss percentage decreased for matches played at home ground.

Objective III - Cristiano Ronaldo's Career Goals¶

In [31]:
# Load datasets for analysis

# Creating a dataframe for Ronaldo's career goals

df = pd.read_csv("data.csv")
df_o = pd.read_csv("overall.csv")

Data Cleaning¶

Drop, Duplicates and Null¶

In [32]:
# Create a new dataframe with only Club and Year columns

data = df[[ 'Club', 'Year']].value_counts()

season_df = pd.DataFrame(data)
season_df = season_df.reset_index(level=[0,1])

# Add .png to club name for easy parsing while visualizing

season_df['path'] =season_df['Club'] + '.png'


# Check for duplicates in the dataset

duplicates = season_df.duplicated().sum()

print(f'Number of duplicates in the dataframe are: {duplicates}')

#identifying the null values

null_values = season_df.isnull().sum().sum()

print(f'Number of null values in the dataframe are: {null_values}')

# Rename column from 0 to Count

season_df.rename(columns={0:'Count'}, inplace=True)
Number of duplicates in the dataframe are: 0
Number of null values in the dataframe are: 0

Sort¶

In [33]:
# Sort dataframe according to Year

season_df.sort_values("Year", inplace=True)
season_df.head()
Out[33]:
Club Year Count path
19 Sporting CP 2003 5 Sporting CP.png
18 Manchester United 2004 6 Manchester United.png
17 Manchester United 2005 9 Manchester United.png
16 Manchester United 2006 12 Manchester United.png
15 Manchester United 2007 23 Manchester United.png
In [34]:
# Find all unique values in the columns

pd.DataFrame(df.apply(lambda col: len(col.unique())),columns=["Unique Values Count"])
Out[34]:
Unique Values Count
Season 21
Competition 16
Matchday 52
Year 21
Date 464
Venue 2
Club 4
Opponent 125
Result 51
Playing_Position 6
Minute 106
At_score 35
Type 12
Goal_assist 87
In [35]:
# generic stats description

df.describe(include=['object']).T
Out[35]:
count unique top freq
Season 701 21 14/15 61
Competition 701 16 LaLiga 311
Matchday 701 52 Group Stage 75
Date 701 464 9/12/15 5
Venue 701 2 H 403
Club 701 4 Real Madrid 450
Opponent 701 125 Sevilla FC 27
Result 701 51 3:00 50
Playing_Position 643 5 LW 356
Minute 701 106 90 17
At_score 701 35 1:00 111
Type 686 11 Right-footed shot 251
Goal_assist 459 86 Karim Benzema 44

Cristiano Ronaldo - Team analysis

Data Visualization¶

In [36]:
# Method to return image to be displayed on graph

def getImage(path):
    return OffsetImage(plt.imread(path), zoom=.07, alpha = 1)

#Extact year data to be displayed on x-axis

x_axis = season_df['Year'].values

#Display graph

fig, ax = plt.subplots(figsize=(12, 4), dpi=150)
plt.title('Goals per season')
plt.xlabel('Year')
plt.ylabel('Number of Goals')


plt.xticks(x_axis)
ax.scatter(x_axis, season_df['Count'])
ax.plot(x_axis, season_df['Count'])

print('------------------------')
print('Teams Ronaldo Played for')
print('------------------------')

display(Image('Real Madrid.png', width=30))
print('Real Madrid')

display(Image('Manchester United.png', width=30))
print('Manchester United')

display(Image('Sporting CP.png', width=30))
print('Sporting CP')

display(Image('Juventus FC.png', width=30))
print('Juventus FC')

#Iterate through rows and add image to graph

for index, row in season_df.iterrows():
    ab = AnnotationBbox(getImage(row['path']), (row['Year'], row['Count']), frameon=False)
    ax.add_artist(ab)
------------------------
Teams Ronaldo Played for
------------------------
Real Madrid
Manchester United
Sporting CP
Juventus FC
  • The above graph shows the number of goals Ronaldo has scored in each season of his career.

  • Ronaldo has scored more than 40 goals in every La Liga season from 2011 to 2017. We have seen that he helped in Manchester United's club success based on its standings and winning probability at home games metric. The trend of his career goals further proves his impact and contribution to a football team.

Interesting Fact¶

One of the most memorable moments in CR7's career internationally is on his hat trick against Spain in 2018's World Cup and rescue a point for Portugal. Enjoy the video just below!

https://www.youtube.com/shorts/dGwpkZ174ng?feature=share

Objective IV - Ronaldo Goals Scoring patterns - La liga¶

Let us determine if there are any patterns to assess Ronaldo's style of playing, where exactly on the football pitch(center, near the box, or outside the box) has he hit more goals. We will consider 2017-18 season where he played for Real Madrid football club:

Data

Data required for the analysis:

Spain_matches.json: LaLiga season 2017-18 matches.
We will gain insights based on this particular season. spain_events.json: Event name, Positions from which goal was scored, event time

In [37]:
#Read La liga - Spain matches data

json_file = open('Spain_matches.json')
laliga_matches_data = json.load(json_file)
In [38]:
#Assign team id to variable
real_madrid_team_id = '675'

#Get Real madrid teams match data
real_madrid_matches  = [data for data in laliga_matches_data if real_madrid_team_id in data['teamsData'].keys()]
In [39]:
#Store Real madrid team Data in Dataframe
real_madrid_matches_df = pd.DataFrame(real_madrid_matches)

real_madrid_matches_df.head()
Out[39]:
status roundId gameweek teamsData seasonId dateutc winner venue wyId label date referees duration competitionId
0 Played 4406122 38 {'675': {'scoreET': 0, 'coachId': 275283, 'sid... 181144 2018-05-19 18:45:00 0 Estadio de la Cer\u00e1mica 2565927 Villarreal - Real Madrid, 2 - 2 May 19, 2018 at 8:45:00 PM GMT+2 [{'refereeId': 395085, 'role': 'referee'}, {'r... Regular 795
1 Played 4406122 37 {'692': {'scoreET': 0, 'coachId': 3880, 'side'... 181144 2018-05-12 18:45:00 675 Estadio Santiago Bernab\u00e9u 2565912 Real Madrid - Celta de Vigo, 6 - 0 May 12, 2018 at 8:45:00 PM GMT+2 [{'refereeId': 398923, 'role': 'referee'}, {'r... Regular 795
2 Played 4406122 34 {'675': {'scoreET': 0, 'coachId': 275283, 'sid... 181144 2018-05-09 19:30:00 680 Estadio Ram\u00f3n S\u00e1nchez Pizju\u00e1n 2565882 Sevilla - Real Madrid, 3 - 2 May 9, 2018 at 9:30:00 PM GMT+2 [{'refereeId': 384946, 'role': 'referee'}, {'r... Regular 795
3 Played 4406122 36 {'675': {'scoreET': 0, 'coachId': 275283, 'sid... 181144 2018-05-06 18:45:00 0 Camp Nou 2565907 Barcelona - Real Madrid, 2 - 2 May 6, 2018 at 8:45:00 PM GMT+2 [{'refereeId': 378950, 'role': 'referee'}, {'r... Regular 795
4 Played 4406122 35 {'675': {'scoreET': 0, 'coachId': 275283, 'sid... 181144 2018-04-28 16:30:00 675 Estadio Santiago Bernab\u00e9u 2565891 Real Madrid - Legan\u00e9s, 2 - 1 April 28, 2018 at 6:30:00 PM GMT+2 [{'refereeId': 385473, 'role': 'referee'}, {'r... Regular 795
In [40]:
#Read Event data from spain_events json file

json_file = open('spain_events.json')
laliga_events_data = json.load(json_file)
In [41]:
#Store Real madrid team Data in Dataframe

laliga_events_df = pd.DataFrame(laliga_events_data)

laliga_events_df.head()
Out[41]:
eventId subEventName tags playerId positions matchId eventName teamId matchPeriod eventSec subEventId id
0 8 Simple pass [{'id': 1801}] 3542 [{'y': 61, 'x': 37}, {'y': 50, 'x': 50}] 2565548 Pass 682 1H 2.994582 85 180864419
1 8 Simple pass [{'id': 1801}] 274435 [{'y': 50, 'x': 50}, {'y': 30, 'x': 45}] 2565548 Pass 682 1H 3.137020 85 180864418
2 8 Simple pass [{'id': 1801}] 364860 [{'y': 30, 'x': 45}, {'y': 12, 'x': 38}] 2565548 Pass 682 1H 6.709668 85 180864420
3 8 Simple pass [{'id': 1801}] 3534 [{'y': 12, 'x': 38}, {'y': 69, 'x': 32}] 2565548 Pass 682 1H 8.805497 85 180864421
4 8 Simple pass [{'id': 1801}] 3695 [{'y': 69, 'x': 32}, {'y': 37, 'x': 31}] 2565548 Pass 682 1H 14.047492 85 180864422

Each player has a Player ID associated with it. Let us create a new dataframe for Ronaldo whose ID is 3322 for our analysis:

In [42]:
#Get event records for Ronaldo (where Player ID = 3322)

ronaldo_events_data_df = laliga_events_df.loc[laliga_events_df['playerId'] == 3322]
Tag Information From Dataframe (Assists, Goals)

101: Goal
301: Assist

In [43]:
#Method for adding additional columns to distinguish gaol, assists, left foot/right foot goals

def add_columns(tags, tag_id):
    return tag_id in [tag['id'] for tag in tags]
In [44]:
#Distinguish gao and assists - store boolean value

ronaldo_events_data_df['Goal'] = ronaldo_events_data_df['tags'].apply(lambda x: add_columns(x, 101))
ronaldo_events_data_df['Assists'] = ronaldo_events_data_df['tags'].apply(lambda x: add_columns(x, 301))
In [45]:
# Display the output

ronaldo_events_data_df.head()
Out[45]:
eventId subEventName tags playerId positions matchId eventName teamId matchPeriod eventSec subEventId id Goal Assists
76412 1 Ground attacking duel [{'id': 501}, {'id': 703}, {'id': 1801}] 3322 [{'y': 26, 'x': 96}, {'y': 27, 'x': 91}] 2565596 Duel 675 1H 28.108732 11 189337977 False False
76414 10 Shot [{'id': 402}, {'id': 2101}, {'id': 201}, {'id'... 3322 [{'y': 27, 'x': 91}, {'y': 0, 'x': 0}] 2565596 Shot 675 1H 31.052085 100 189337978 False False
76457 8 Simple pass [{'id': 1801}] 3322 [{'y': 53, 'x': 68}, {'y': 67, 'x': 53}] 2565596 Pass 675 1H 146.902499 85 189338004 False False
76589 10 Shot [{'id': 402}, {'id': 201}, {'id': 1201}, {'id'... 3322 [{'y': 48, 'x': 96}, {'y': 0, 'x': 0}] 2565596 Shot 675 1H 548.744061 100 189338889 False False
76654 1 Air duel [{'id': 702}, {'id': 1801}] 3322 [{'y': 84, 'x': 62}, {'y': 81, 'x': 42}] 2565596 Duel 675 1H 713.899672 10 189338224 False False
In [46]:
#Adding match information to the events DataFrame
ronaldo_events_data_df = pd.merge(ronaldo_events_data_df, real_madrid_matches_df, left_on='matchId', right_on='wyId', how="left")
In [47]:
ronaldo_events_data_df.head(2)
Out[47]:
eventId subEventName tags playerId positions matchId eventName teamId matchPeriod eventSec ... seasonId dateutc winner venue wyId label date referees duration competitionId
0 1 Ground attacking duel [{'id': 501}, {'id': 703}, {'id': 1801}] 3322 [{'y': 26, 'x': 96}, {'y': 27, 'x': 91}] 2565596 Duel 675 1H 28.108732 ... 181144 2017-09-20 20:00:00 684 Estadio Santiago Bernab\u00e9u 2565596 Real Madrid - Real Betis, 0 - 1 September 20, 2017 at 10:00:00 PM GMT+2 [{'refereeId': 384946, 'role': 'referee'}, {'r... Regular 795
1 10 Shot [{'id': 402}, {'id': 2101}, {'id': 201}, {'id'... 3322 [{'y': 27, 'x': 91}, {'y': 0, 'x': 0}] 2565596 Shot 675 1H 31.052085 ... 181144 2017-09-20 20:00:00 684 Estadio Santiago Bernab\u00e9u 2565596 Real Madrid - Real Betis, 0 - 1 September 20, 2017 at 10:00:00 PM GMT+2 [{'refereeId': 384946, 'role': 'referee'}, {'r... Regular 795

2 rows × 28 columns

In [48]:
#Calculate number of goals scored
goals = [ronaldo_events_data_df['Goal'].sum()]

#Calculate number of assists
assists = [ronaldo_events_data_df['Assists'].sum()]

#Calculate number of shots attempted
shots_attempted = [ronaldo_events_data_df[ronaldo_events_data_df['eventName'] == 'Shot'].count()['eventName']]


statistics = pd.DataFrame([goals, assists, shots_attempted], 
                        columns=['Ronaldo'], 
                        index=['Goal', 'Assists', 'Shots'])

print('-----Ronaldos Goals, Assists Recieved and Shots Attemped-----')
statistics.head()
-----Ronaldos Goals, Assists Recieved and Shots Attemped-----
Out[48]:
Ronaldo
Goal 26
Assists 5
Shots 151

Data Visualization

In [49]:
#Method to draw football pitch
#initializing football pitch variables

width = 700
height = 350
width_pitch = 104
height_pitch = 68
color = 'green'
line_color = 'white'
grey_color = '#808080'
 

def draw_football_pitch():
    
    #Create  Figure for plotting
    pitch = figure(width = width, height = height, toolbar_location="right")
    
    #Draw outline for empty pitch - football ground
    pitch.rect(x=width_pitch/2., y=height_pitch/2., width=width_pitch, height=height_pitch, fill_color=color, line_width=2, line_color=line_color)
    
    #Drawe left penalty area
    pitch.circle(16.5, height_pitch/2., size=50, fill_color=color, line_width=2, line_color=line_color)
    
    #Draw Bigger rectangle
    pitch.rect(x=16.5/2., y=height_pitch/2., width=16.5, height=40.3, fill_color=color, line_width=2, line_color=line_color)
    
    #Draw Smaller rectangle
    pitch.rect(x=5.5/2., y=height_pitch/2., width=5.5, height=18.3, fill_color=color, line_width=2, line_color=line_color)
    
    #Draw Goal post
    pitch.rect(x=0, y=height_pitch/2., width=0.5, height=7.3, fill_color=color, line_width=2, line_color=line_color)
    
    
    #Draw Penalty spot
    pitch.circle(11, height_pitch/2., size=2, fill_color=line_color, line_width=2, line_color=line_color)
    
    #Draw right penalty area
    pitch.circle((width_pitch-16.5), height_pitch/2., size=50, fill_color=color, line_width=2, line_color=line_color)
    
    pitch.rect(x=width_pitch-(16.5/2.), y=height_pitch/2., width=16.5, height=40.3, fill_color=color, line_width=2, line_color=line_color)
    
    #Draw Smaller rectangle
    pitch.rect(x=width_pitch-(5.5/2.), y=height_pitch/2., width=5.5, height=18.3, fill_color=color, line_width=2, line_color=line_color)
    #Draw Goal post
    pitch.rect(x=width_pitch, y=height_pitch/2., width=0.5, height=7.3, fill_color=line_color, line_width=2, line_color=line_color)
    #Draw Penalty spot
    pitch.circle((width_pitch-11), height_pitch/2., size=2, fill_color=line_color, line_width=2, line_color=line_color)
    
    #Draw middle of pitch
    pitch.circle(width_pitch/2.0, y=height_pitch/2.0, size=100, fill_color=color, line_width=2, line_color=line_color)
    pitch.circle(width_pitch/2.0, y=height_pitch/2.0, size=2, fill_color=line_color, line_width=2, line_color=line_color)
    pitch.line([width_pitch/2.0, width_pitch/2.0], [0, height_pitch], line_width=2, line_color=line_color)
    
    return pitch
    
In [50]:
#Plot positions from where goals were attempted
def plot_position_data(player_data, action_name, color_of_plot):

    x_axis = [(player_data[0]['x']*105)/100. for player_data in player_data]
    y_axis = [(player_data[0]['y']*69)/100. for player_data in player_data]

    pitch = draw_football_pitch()
    pitch.circle(x_axis, y_axis, fill_color=color_of_plot, line_width=1, line_color="blue", fill_alpha=0.2, size=8)
    player_statistics = bokeh.models.Label(x=90, y=280,x_units='screen', y_units='screen', text=str(len(x_axis)) + " " + action_name, text_font_size= '20px', render_mode='css', text_color = 'white')
    pitch.add_layout(player_statistics)

    return pitch
In [51]:
#Extract positions from columns from which goals were shot
ronaldo_goals = ronaldo_events_data_df[ronaldo_events_data_df['Goal'] == True]['positions']
In [52]:
#Extract positions from where shots where attempted
shots_data = ronaldo_events_data_df[ronaldo_events_data_df['eventName'] == 'Shot']
ronaldo_shots = shots_data['positions']
In [53]:
#Plot ronaldo goals
plot_goals = plot_position_data(ronaldo_goals, 'Goals', 'grey')
In [54]:
#Plot shots
plot_shots = plot_position_data(ronaldo_shots, 'Shots', 'grey')
In [55]:
#Generic function to beautify output of print statement
class color:
   BOLD = '\033[1m'
In [56]:
#Visualize shots
print(color.BOLD + 'Positions from which Ronaldo attempted a Goal')
print(color.BOLD + '------------------------------------------')
show(plot_shots)
Positions from which Ronaldo attempted a Goal
------------------------------------------

From the visualization of the football field pitch, we can observe that Ronaldo Attempted 151 Goals . Positions from which Ronaldo attempted a goal have been plotted which Blue circles.

  • In the 2017-18 season, the number of shots Cristiano Ronaldo attempted against opponent's team are 151.
  • The shot attempt distribution is all over the box

Let us now analyze how many of these shots attempted are converted into goals:

In [57]:
#Visualize Goals
display(Image('Ronaldo.jpeg', width=100))

print(color.BOLD + 'Positions from which Ronaldo scored a Goal')
print(color.BOLD + '------------------------------------------')
show(plot_goals)
Positions from which Ronaldo scored a Goal
------------------------------------------
  • From the above plots, we can see that all of the Ronaldo's goals in 2017-18 season for Real Madrid football club were from inside the box area.

  • He converted 26 of attempted shots into goals depicting how lethal he is once he is inside the penalty box.

Conclusion¶

Our data backs up the fact that Cristiano Ronaldo is hands down, one of the best players in the history of football. He has been a huge part of the successes of the clubs he's played for.

Objective I¶

  • From the line graph, we can see that the performance of Manchester United peaked when Ronaldo played for them. That is, they were the champions consecutively for three seasons.
  • We can see the inconsistency in their Premier League finishes after he left the club. They were crowned as the champions just once and we can see the downfall in the coming years.
This proves that he played a huge role in the club's success.¶

Objective II¶

  • We try to quantify the inference of the above analysis. We can see that the team won 76% of the times at home when Ronaldo played and 67% of the matches he did not.
  • This gives us a strong hint that he had a positive impact on other players, boosting the chemistry of the team which resulted in a boosted performance.

Objective III¶

  • We can interpret that more than 50% of Cristiano's goals were with Real Madrid. It's normal because he was signed to Real Madrid from Manchester United on July 1st, 2009, and left for Juventus on July 10th, 2018. He was with Real Madrid FC for nine years and nine days, which is more than other clubs.
  • For 6 consecutive seasons he's scored more than 50 goals.

Objective IV¶

  • Analysing his career's trajectory, we drill down into his one of the best seasons for Real Madrid. By doing that, we have tried to analyze the pattern in which he scores goals.

Recommendation¶

  • We have successfully proven that he has been a great factor in all the clubs he has played for. Manchester United has seen a terrible decade after Ronaldo's departure.
  • We suggest recruiting a player who shows similar skills in his early years, as the club lacks in good strikers. The perfect solution according to us is purchasing a young winger, whose stats match Ronald's in his early years.

Citations:¶

  1. Kaggle- https://www.kaggle.com/

  2. Reserach Paper - A public data set of spatio-temporal match events in soccer competitions https://www.nature.com/articles/s41597-019-0247-7

  3. Wikipedia - https://en.wikipedia.org/wiki/List_of_Manchester_United_F.C._seasons

  4. Images - https://www.google.com/